Content-free Document Genre Classification using First Order Random Graphs
نویسندگان
چکیده
We approach the general problem of machineprinted document genre classification using contentfree layout structure analysis. Document genre is determined from the layout structure detected from scanned binary images of the document pages, using no OCR results and minimal a priori knowledge of document logical structures. Our approach uses attributed relational graphs (ARGs) to represent the layout structure of document instances, and a first order random graphs (FORGs) to represent document genres. In this paper we develop our FORG-based genre classification method and present a comparative evaluation between our technique and a variety of statistical pattern classifiers. FORGs are capable of modeling common layout structure within a document genre and are shown to outperform traditional pattern classification techniques when fine-grained genre distinctions must be drawn.
منابع مشابه
Fine-Grained Document Genre Classification Using First Order Random Graphs
We approach the general problem of classifying machine-printed documents into genres. Layout is a critical factor in recognizing fine-grained genres, as document content features are similar. Document genre is determined from the layout structure detected from scanned binary images of the document pages, using no OCR results and minimal a priori knowledge of document logical structures. Our met...
متن کاملGenre Classification of Web Documents
Retrieving relevant documents over the Web is an overwhelming task when search engines return thousands of Web documents. Sifting through these documents is time-consuming and sometimes leads to an unsuccessful search. One problem is that most search engines rely on matching a query to documents based solely on topical keywords. However, many users of search engines have a particular genre in m...
متن کاملSearching in document images: what does the appearance of a document tell us about what it means?
The document understanding problem can be informally defined as the automatic extraction of meaning from documents. In the Intelligent Sensory Information Systems group we have experimented with analyzing the visual appearance of documents in order to extract meaning. That is, we concentrate on how documents look, rather than on what they say. We motivate this approach with several applications...
متن کاملThesis Stereotyping the Web: Genre Classification of Web Documents
OF THESIS STEREOTYPING THE WEB: GENRE CLASSIFICATION OF WEB DOCUMENTS Retrieving relevant documents over the Web is a difficult task. Currently, search engines rely on keywords for matching documents to user queries. This paper explores the potential for discriminating documents based on the genre of the document. I define genre as a taxonomy that incorporates the style, form and content of a d...
متن کاملClassification of document page images based on visual similarity of layout structures
Searching for documents by their type or genre is a natural way to enhance the eeectiveness of document retrieval. The layout of a document contains a signiicant amount of information that can be used to classify a document's type in the absence of domain speciic models. A document type or genre can be deened by the user based primarily on layout structure. Our classiication approach is based o...
متن کامل